-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[ENH] Add Schema to js client #5621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
This stack of pull requests is managed by Graphite. Learn more about stacking. |
c51fb45 to
61c4404
Compare
f40f1ea to
e3a6cb8
Compare
61c4404 to
3316d46
Compare
e3a6cb8 to
5439079
Compare
3316d46 to
80453d2
Compare
fe26daa to
ac00bdc
Compare
8bd6af6 to
be61947
Compare
ac00bdc to
a9944bd
Compare
|
Add Schema Support to JS Client with Dense and Sparse Vector Indexing This PR introduces a comprehensive schema system to the JavaScript client for ChromaDB, bringing parity with Python for collection schema configuration, embedding function management, and test coverage. The changes provide a flexible and extensible schema abstraction, allowing users to programmatically define, configure, serialize, and deserialize index types (including sparse and dense vector indexes, inverted indexes, FTS, and more) for each field in a collection. The update includes end-to-end schema serialization tests, dense and sparse vector embedding integration, and logic in the core collection path to utilize schema-aware embedding function resolution and sparse auto-embedding. Key Changes• Introduced Affected Areas• This summary was automatically generated by @propel-code-bot |
| data.map(async (collection) => | ||
| new CollectionImpl({ | ||
| chromaClient: this, | ||
| apiClient: this.apiClient, | ||
| name: collection.name, | ||
| id: collection.id, | ||
| embeddingFunction: await getEmbeddingFunction( | ||
| collection.name, | ||
| collection.configuration_json.embedding_function ?? undefined, | ||
| ), | ||
| configuration: collection.configuration_json, | ||
| metadata: collection.metadata ?? undefined, | ||
| schema: Schema.deserializeFromJSON(collection.schema ?? undefined), | ||
| }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
The logic for creating a CollectionImpl instance from an API response is repeated in listCollections, createCollection, getCollection, and getOrCreateCollection. To improve maintainability and reduce code duplication, consider extracting this logic into a private helper method, for example _collectionFromResponseData(data, embeddingFunction?). This would centralize the construction of collection objects from API data.
ChromaDB Best Practice: Following the official ChromaDB JavaScript client patterns, API responses should be consistently transformed into Collection objects. A helper method ensures consistent handling of the API response structure and metadata across all collection operations.
Context for Agents
[**BestPractice**]
The logic for creating a `CollectionImpl` instance from an API response is repeated in `listCollections`, `createCollection`, `getCollection`, and `getOrCreateCollection`. To improve maintainability and reduce code duplication, consider extracting this logic into a private helper method, for example `_collectionFromResponseData(data, embeddingFunction?)`. This would centralize the construction of collection objects from API data.
**ChromaDB Best Practice**: Following the official ChromaDB JavaScript client patterns, API responses should be consistently transformed into Collection objects. A helper method ensures consistent handling of the API response structure and metadata across all collection operations.
File: clients/new-js/packages/chromadb/src/chroma-client.ts
Line: 232be61947 to
9322a2c
Compare
a9944bd to
2bb4533
Compare
| @@ -0,0 +1,1002 @@ | |||
| import type { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
The new schema.ts file is quite large and contains many distinct concepts (constants, config classes, type classes, utility functions, and the main Schema class). For better maintainability and code organization, consider splitting this file into smaller, more focused modules.
For example, you could structure it like this:
src/schema/constants.tssrc/schema/index-configs.tssrc/schema/value-types.tssrc/schema/utils.tssrc/schema/schema.ts(for the main class)src/schema/index.ts(to re-export everything)
This would make the code easier to navigate and understand.
Context for Agents
[**BestPractice**]
The new `schema.ts` file is quite large and contains many distinct concepts (constants, config classes, type classes, utility functions, and the main `Schema` class). For better maintainability and code organization, consider splitting this file into smaller, more focused modules.
For example, you could structure it like this:
- `src/schema/constants.ts`
- `src/schema/index-configs.ts`
- `src/schema/value-types.ts`
- `src/schema/utils.ts`
- `src/schema/schema.ts` (for the main class)
- `src/schema/index.ts` (to re-export everything)
This would make the code easier to navigate and understand.
File: clients/new-js/packages/chromadb/src/schema.ts
Line: 19322a2c to
8e1d146
Compare
d567604 to
e636236
Compare
| }); | ||
|
|
||
| // Generate embeddings for all collected documents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Array length validation missing: If sparseEmbeddings.length !== positions.length, the error is thrown but the actual values aren't included in the error message, making debugging difficult.
Suggested Change
| }); | |
| // Generate embeddings for all collected documents | |
| if (sparseEmbeddings.length !== positions.length) { | |
| throw new ChromaValueError( | |
| `Sparse embedding function returned unexpected number of embeddings. Expected ${positions.length}, got ${sparseEmbeddings.length}`, | |
| ); | |
| } |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
Context for Agents
[**BestPractice**]
Array length validation missing: If `sparseEmbeddings.length !== positions.length`, the error is thrown but the actual values aren't included in the error message, making debugging difficult.
<details>
<summary>Suggested Change</summary>
```suggestion
if (sparseEmbeddings.length !== positions.length) {
throw new ChromaValueError(
`Sparse embedding function returned unexpected number of embeddings. Expected ${positions.length}, got ${sparseEmbeddings.length}`,
);
}
```
⚡ **Committable suggestion**
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
</details>
File: clients/new-js/packages/chromadb/src/collection.ts
Line: 415|
|
||
| const sparseEmbeddings = await this.sparseEmbed(embeddingFunction, inputs, false); | ||
| if (sparseEmbeddings.length !== positions.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Same issue: include actual vs expected values in error message for better debugging.
Suggested Change
| const sparseEmbeddings = await this.sparseEmbed(embeddingFunction, inputs, false); | |
| if (sparseEmbeddings.length !== positions.length) { | |
| if (sparseEmbeddings.length !== positions.length) { | |
| throw new ChromaValueError( | |
| `Sparse embedding function returned unexpected number of embeddings. Expected ${positions.length}, got ${sparseEmbeddings.length}`, | |
| ); | |
| } |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
Context for Agents
[**BestPractice**]
Same issue: include actual vs expected values in error message for better debugging.
<details>
<summary>Suggested Change</summary>
```suggestion
if (sparseEmbeddings.length !== positions.length) {
throw new ChromaValueError(
`Sparse embedding function returned unexpected number of embeddings. Expected ${positions.length}, got ${sparseEmbeddings.length}`,
);
}
```
⚡ **Committable suggestion**
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
</details>
File: clients/new-js/packages/chromadb/src/collection.ts
Line: 452| private async applySparseEmbeddingsToMetadatas( | ||
| metadatas?: Metadata[], | ||
| documents?: string[], | ||
| ): Promise<Metadata[] | undefined> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
This function is quite long and contains duplicated logic for handling sourceKey === DOCUMENT_KEY and the general case where sourceKey is a metadata field. The core logic of collecting inputs, generating embeddings, and updating metadata is nearly identical in both branches.
Consider extracting this shared logic into a helper function to reduce duplication and improve readability. The helper could take parameters like targetKey, config, updatedMetadatas, and documentsList and handle the embedding process for a single sparse target. This would make applySparseEmbeddingsToMetadatas much simpler, as it would then be primarily responsible for iterating through sparseTargets and calling the helper.
Context for Agents
[**BestPractice**]
This function is quite long and contains duplicated logic for handling `sourceKey === DOCUMENT_KEY` and the general case where `sourceKey` is a metadata field. The core logic of collecting inputs, generating embeddings, and updating metadata is nearly identical in both branches.
Consider extracting this shared logic into a helper function to reduce duplication and improve readability. The helper could take parameters like `targetKey`, `config`, `updatedMetadatas`, and `documentsList` and handle the embedding process for a single sparse target. This would make `applySparseEmbeddingsToMetadatas` much simpler, as it would then be primarily responsible for iterating through `sparseTargets` and calling the helper.
File: clients/new-js/packages/chromadb/src/collection.ts
Line: 362517106f to
c2c379d
Compare
8bc1124 to
04d4b3f
Compare
c2c379d to
d6914e9
Compare
04d4b3f to
094df60
Compare
d6914e9 to
9e020d9
Compare
094df60 to
9920e75
Compare
| const resolveSchemaEmbeddingFunction = ( | ||
| schema: Schema | undefined, | ||
| ): EmbeddingFunction | undefined => { | ||
| if (!schema) { | ||
| return undefined; | ||
| } | ||
|
|
||
| const embeddingOverride = | ||
| schema.keys[EMBEDDING_KEY]?.floatList?.vectorIndex?.config.embeddingFunction ?? undefined; | ||
| if (embeddingOverride) { | ||
| return embeddingOverride; | ||
| } | ||
|
|
||
| return schema.defaults.floatList?.vectorIndex?.config.embeddingFunction ?? undefined; | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
The logic in resolveSchemaEmbeddingFunction is duplicated in clients/new-js/packages/chromadb/src/collection.ts (lines 473-487) as getSchemaEmbeddingFunction. To adhere to the DRY principle, this logic should be extracted into a single, shared function. A good location for this would be the new schema.ts file, as it's directly related to schema interpretation and follows ChromaDB's TypeScript architectural patterns.
Additionally, the use of ?? undefined is redundant since optional chaining (?.) already results in undefined if any part of the chain is nullish.
Here's a suggested implementation for a shared function in schema.ts:
// in clients/new-js/packages/chromadb/src/schema.ts
export const resolveSchemaEmbeddingFunction = (
schema: Schema | undefined,
): EmbeddingFunction | undefined => {
if (!schema) {
return undefined;
}
return (
schema.keys[EMBEDDING_KEY]?.floatList?.vectorIndex?.config.embeddingFunction ??
schema.defaults.floatList?.vectorIndex?.config.embeddingFunction
);
};This refactoring aligns with ChromaDB's TypeScript patterns where embedding functions are modular and reusable components. The new function can then be imported and used in both chroma-client.ts and collection.ts, removing the duplicated implementations and improving maintainability.
Context for Agents
[**BestPractice**]
The logic in `resolveSchemaEmbeddingFunction` is duplicated in `clients/new-js/packages/chromadb/src/collection.ts` (lines 473-487) as `getSchemaEmbeddingFunction`. To adhere to the DRY principle, this logic should be extracted into a single, shared function. A good location for this would be the new `schema.ts` file, as it's directly related to schema interpretation and follows ChromaDB's TypeScript architectural patterns.
Additionally, the use of `?? undefined` is redundant since optional chaining (`?.`) already results in `undefined` if any part of the chain is nullish.
Here's a suggested implementation for a shared function in `schema.ts`:
```typescript
// in clients/new-js/packages/chromadb/src/schema.ts
export const resolveSchemaEmbeddingFunction = (
schema: Schema | undefined,
): EmbeddingFunction | undefined => {
if (!schema) {
return undefined;
}
return (
schema.keys[EMBEDDING_KEY]?.floatList?.vectorIndex?.config.embeddingFunction ??
schema.defaults.floatList?.vectorIndex?.config.embeddingFunction
);
};
```
This refactoring aligns with ChromaDB's TypeScript patterns where embedding functions are modular and reusable components. The new function can then be imported and used in both `chroma-client.ts` and `collection.ts`, removing the duplicated implementations and improving maintainability.
File: clients/new-js/packages/chromadb/src/chroma-client.ts
Line: 41|
|
||
| // Create copies, converting null to empty object | ||
| const updatedMetadatas = metadatas.map((metadata) => | ||
| metadata !== null && metadata !== undefined ? { ...metadata } : {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
This line can be simplified. The check metadata !== null && metadata !== undefined can be shortened to metadata != null, which checks for both null and undefined.
| metadata !== null && metadata !== undefined ? { ...metadata } : {} | |
| metadata != null ? { ...metadata } : {} |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
Context for Agents
[**BestPractice**]
This line can be simplified. The check `metadata !== null && metadata !== undefined` can be shortened to `metadata != null`, which checks for both `null` and `undefined`.
```suggestion
metadata != null ? { ...metadata } : {}
```
⚡ **Committable suggestion**
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
File: clients/new-js/packages/chromadb/src/collection.ts
Line: 378| export const getSparseEmbeddingFunction = ( | ||
| collectionName: string, | ||
| efConfig?: EmbeddingFunctionConfiguration, | ||
| ) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
The console.warn calls were removed from this function for cases where efConfig is missing or has type legacy. However, getEmbeddingFunction still includes these warnings. For a better developer experience and consistency, it would be helpful to add the warnings back to getSparseEmbeddingFunction. This would alert developers when a sparse embedding function is configured but cannot be instantiated, which could otherwise lead to silent failures or unexpected behavior.
ChromaDB Context: Both dense and sparse embedding functions in ChromaDB follow similar configuration patterns. Consistent warning messages across both types help developers quickly identify and resolve embedding function configuration issues, especially when working with legacy configurations or missing parameters.
Context for Agents
[**BestPractice**]
The `console.warn` calls were removed from this function for cases where `efConfig` is missing or has type `legacy`. However, `getEmbeddingFunction` still includes these warnings. For a better developer experience and consistency, it would be helpful to add the warnings back to `getSparseEmbeddingFunction`. This would alert developers when a sparse embedding function is configured but cannot be instantiated, which could otherwise lead to silent failures or unexpected behavior.
**ChromaDB Context**: Both dense and sparse embedding functions in ChromaDB follow similar configuration patterns. Consistent warning messages across both types help developers quickly identify and resolve embedding function configuration issues, especially when working with legacy configurations or missing parameters.
File: clients/new-js/packages/chromadb/src/embedding-function.ts
Line: 2489920e75 to
c798a6c
Compare

Description of changes
Summarize the changes made by this PR.
Test plan
How are these changes tested?
added schema unit tests matching python ones
pytestfor python,yarn testfor js,cargo testfor rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?